-
Notifications
You must be signed in to change notification settings - Fork 980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-5688: Add repeated map support to column accessors #887
Conversation
Includes the core JSON-like reader and writer interfaces and implementations.
Includes changes to value vectors, DrillBuf and other low-level classes.
Modifications to the row set abstraction (used for testing) for the changed accessors. Row sets also act as tests for the accessor classes, including a number of tests that test the classes used for testing. (Yes, somewhat recursive…)
Changes to unit tests, and the unit test framework, required by the changes to the accessor and row set classes.
This set of commits provide performance tuning of the column writer implementation. Discussion during the code review identified a number of tuning opportunities that lead to this work. The commits divide the files by layer:
The result is a very nice performance improvement relative to the original vector mutator methods. The writers now provide:
Not a bad outcome from a casual code review conversation! Merge Vector Overflow into Column WritersDRILL-5688 and DRILL-5657 originally separated the tasks of detecting and handling vector overflow. (Vector overflow occurs when we are about to write past the maximum buffer size.) The result was that each “set” operation needed to traverse two layers: the “column loader” which handles overflow, and the “column writer” which writes the value to a vector. Discussion identified an opportunity to combine these into a single layer. The revised column accessors in this PR achieve that revision: each accessor detects overflow, call a handler to handle the overflow, then continues to write operation into the (presumably) new vector. The “result set loader” (DRILL-5657) abstraction contains the logic to handle overflows. The original “row set” abstractions for tests don’t handle overflow; instead they throw a index out-of-bounds exception. (Tests seldom write enough data to trigger an overflow unless deliberately testing the overflow feature.) Bypass the Vector MutatorThe big performance improvement was to avoid the overhead of the original vector mutator. In the original code, each write travels through many layers of code:
The revised column writers bypass the first three layers; and instead use the Netty platform dependent methods to write to vector memory. In this, they follow the text reader implementation which does something similar. Specialized Offset Vector WriterFor arrays (repeated) and variable-width values, the original code is quite inefficient as it must do a direct memory read, and two writes, for each value. The revised writer code for variable-width fields does two writes. For arrays (repeated) values, the writer code does just one write per value and one write per row. (This is why the writer code is so much faster: the test writes five values per row: 15 direct memory I/Os in the original code, just 6 in the new writer code.) A new Highly-Optimized Writer CodeThe unfortunate side-effect of the optimization is that the writers are no longer simple. The previous version (which should be reviewed first) delegate work to the vector mutator. The new version provides an integrated set of code that:
On the other hand, it became clear that the writers could be simplified. The previous version, because it worked with the mutators in each vector class, needed a separate writer for required, optional and repeated vectors. The new code, because it works directly with Netty’s platform dependent layer, can use a single writer for all three data modes. Now, one might note that the single implementation has some very slight overhead:
However, the cost is a single comparison, so it did not seem worth the effort to generate three complex classes to avoid a single comparison. (The performance test results seem to back up this assumption.) Because of the above, the Non-zeroing Vector ReallocationThe existing The revised column writer code avoids these two performance penalties by
Existing code continues to use the original implementation. Collateral Code CleanupBecause vector overflow is now integrated into the writers, there is no longer a need to throw the prior DRILL-5517 added “size aware” methods to value vectors to support the earlier version of the writers which used the vector mutators. Since that logic is now integrated into the writers themselves, this commit backs out those changes to tidy up the value vector code. Previous versions of the row set writers flattened maps for easier creation of test data in the Retrofitted a couple of tests to use the Split the Earlier versions had a way to set every column type from an int value for testing. Moved this code around to make use of the generic object-based set methods now available. Added a Added a ReadersA logical question is, “if the column writers benefited from optimization, what about the column readers?” Writers are the focus of the “memory fragmentation” project which will replace the low-level vector writers in various operators. For now, readers are only used in test code - which is not performance critical. We can apply the same technique to readers when they find themselves on the critical performance path. |
The first cut at initializing the offset vector had a bug when the vector was not previously initialized. This commit fixes that problem.
Closing as this PR is now superseded by #914. |
Restructures the existing "column accessor" code to adopt a JSON-like structure that works for all of Drill's data types. This PR focused on the "repeated map" vector. (Still to come is support for repeated lists, but they fit into the revised JSON structure.)
This PR has four commits that highlight different parts of the changes:
Accessors
The accessor structure is explained in
package_info.java
files in the accessor packages. Basically, the structure is:The accessors add an "object" layer that represents any of the three types. So, a tuple is really a list of (name, object accessor) pairs, where the object accessor provide access to a scalar, an array or a tuple as appropriate for each column.
The structure appears complex (since it must model JSON). But, an app using this code would use just the leaf scalar readers and writers. These classes currently access data via the value vector
Mutator
andAccessor
classes. But, the goal is to eventually access the NettyPlatformDependent
methods directly so that there is a single layer between the application and the call into direct memory. (Today there are multiple layers.)There is quite a bit of code change here to provide the new structure. But, the core functionality of reading and writing to vectors has not changed much. And, this code has extensive unit tests, which should avoid the need to "mentally execute" each line of code.
Supporting Classes
A new
TupleMetadata
class is a superset of the existingBatchSchema
, but also provides "classic" tuple-like access by position or name. Eventually, this will also hold additional information such as actual size and so on (information now laboriously rediscovered by the "record batch sizer.") Since the accessors use a "tuple" abstraction to model both rows and maps, the tuple metadata provides this same view. The top-most tuple models the row. Columns within the row can be maps, which have their own tuple schema, and so on.TupleNameSpace
moves locations (so it can be used in the vector package) but otherwise remains unchanged.DrillBuf
provides an experimentalputInt()
method that does bounds checking and sets a value, to minimize calls. This will probably move into the writer in a later PR.This PR fixes DRILL-5690, a bug in repeated vectors that did not pass along Decimal scale and precision. See
RepeatedValueVectors.java
.MaterializedField
changes to add anisEquivalent()
method to compare two fields, ignoring internal ($offset$
,$bits$
, etc.) vectors.Row Set Classes and Tests
The
RowSet
family of classes changed in response to the accessor changes.RowSetBuilder
andRowSetComparison
test tools added support for repeated maps.RowSetBuilder
into the accessors.RowSetSchema
evolved to become theTupleMetadata
mentioned above.In the previous version, the row set classes had complex logic to figure out what kind of accessor to create for each vector. This became overly complex. In this version, the row set "parses" a vector container to create "storage" objects that represent tuples and columns. A column can, itself, be a tuple. (Note: there is no need to model lists since lists are just vectors at this level of abstraction, so need no special handling.)
With this change, accessor creation is a simple matter of walking a tree to assemble the JSON-structure.
This structure is also used to create a batch's vectors from a schema.
Other Changes
The last commit contains various other changes, mostly reflecting the changes above.